Day 23|把服務撐起來:健康檢查、監控、穩定度三寶(FastAPI × LINE SDK v3)
目標:上線後不再靠運氣。今天把專案補上
1. 健康檢查(liveness / readiness)
2. 結構化日誌+Request ID 關聯
3. 穩定度三件套:超時、重試、熔斷/節流
4. (可選)Prometheus 指標,一眼看懂流量與錯誤
⸻
Liveness 用來告訴平台「我還活著」,Readiness 表示「我現在可以接單了」。
在 Render/K8s/雲主機都通用。
from fastapi import APIRouter
from fastapi.responses import PlainTextResponse
import httpx, os
router = APIRouter()
@router.get("/healthz")
async def healthz():
# 輕量檢查:進程活著即可
return PlainTextResponse("ok", status_code=200)
@router.get("/readyz")
async def readyz():
# 重要相依:環境變數、外部服務可用性(快速、設超時)
required_envs = ["CHANNEL_ACCESS_TOKEN", "CHANNEL_SECRET"]
missing = [k for k in required_envs if not os.getenv(k)]
if missing:
return PlainTextResponse(f"env missing: {','.join(missing)}", status_code=503)
try:
# 以 LINE Profile API 當快速探針(或你選擇的輕探針)
async with httpx.AsyncClient(timeout=2) as c:
# 不必真的打受權 API,打公共端點或 DNS 也行
await c.get("https://api.line.me/", timeout=2)
except Exception as e:
return PlainTextResponse(f"dep: line_api {e}", status_code=503)
return PlainTextResponse("ready", status_code=200)
在 app_fastapi.py:
from health import router as health_router
app.include_router(health_router)
好處
• 平台可自動重啟「掛住的」實例(healthz 失敗)。
• 部署時等到 readiness=OK 才納入流量,避免冷啟期間丟事件。
⸻
把 log 變成可搜尋的 JSON,並用 Request ID 串起一整條呼叫鏈(Webhook → 你的處理 → 外部 API)。
2.1 Middleware 加 Request ID
import logging, json, uuid
from typing import Callable
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response
from contextvars import ContextVar
request_id_ctx: ContextVar[str] = ContextVar("request_id", default="-")
class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {
"level": record.levelname,
"msg": record.getMessage(),
"logger": record.name,
"request_id": request_id_ctx.get("-"),
}
if record.exc_info:
payload["exc_info"] = self.formatException(record.exc_info)
return json.dumps(payload, ensure_ascii=False)
def setup_json_logging():
h = logging.StreamHandler()
h.setFormatter(JsonFormatter())
root = logging.getLogger()
root.handlers = []
root.addHandler(h)
root.setLevel(logging.INFO)
class RequestIdMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next: Callable):
rid = request.headers.get("x-request-id", str(uuid.uuid4()))
token = request_id_ctx.set(rid)
try:
resp: Response = await call_next(request)
resp.headers["x-request-id"] = rid
return resp
finally:
request_id_ctx.reset(token)
啟用:
from logging_setup import setup_json_logging, RequestIdMiddleware
setup_json_logging()
app.add_middleware(RequestIdMiddleware)
使用:
import logging
log = logging.getLogger("uvicorn.error")
log.info("handle text event") # 自動帶出 request_id
查問題時,只要在 Log 面板搜尋某個 request_id,就能看到整段過程。
⸻
3.1 通用的 httpx 輕量封裝(含超時+重試)
import httpx, asyncio, random
async def fetch_json(url: str, headers=None, timeout=4.0, retries=2, backoff=(0.2, 0.8)):
last = None
for i in range(retries + 1):
try:
async with httpx.AsyncClient(timeout=timeout) as c:
r = await c.get(url, headers=headers)
r.raise_for_status()
return r.json()
except Exception as e:
last = e
if i < retries:
await asyncio.sleep(random.uniform(*backoff))
raise last
使用(例如匯率 API):
from http_helpers import fetch_json
async def get_twd_per(target="JPY"):
data = await fetch_json(f"https://open.er-api.com/v6/latest/{target}")
return data["rates"]["TWD"]
3.2 熔斷 & 節流(超簡版)
避免外部服務掛了你還狂打;也避免某使用者刷爆群組。
import time
FAIL_MAX = 5
COOL_DOWN = 30 # 秒
_state = {"fails":0, "until":0}
def can_call() -> bool:
return time.time() >= _state["until"]
def record(success: bool):
if success:
_state["fails"] = 0
_state["until"] = 0
else:
_state["fails"] += 1
if _state["fails"] >= FAIL_MAX:
_state["until"] = time.time() + COOL_DOWN
呼叫時:
from circuit import can_call, record
if not can_call():
return "外部服務繁忙,稍後再試 🙏"
try:
# do external call...
record(True)
except Exception:
record(False)
raise
使用者節流(per-chat 簡易限流):
import time
WINDOW = 5 # 秒
MAX_REQ = 8 # 視需求調整
_buckets = {} # chat_id -> [(ts1), (ts2)...]
def allow(chat_id: str) -> bool:
now = time.time()
q = _buckets.setdefault(chat_id, [])
# 移除過期
while q and now - q[0] > WINDOW:
q.pop(0)
if len(q) >= MAX_REQ:
return False
q.append(now)
return True
在文字事件最前面:
from throttle import allow
if not allow(chat_id):
await reply("稍等一下下~我先喘口氣 😮💨")
return
⸻
想要粗粒度監控非常好用:QPS、錯誤率、熱門功能。
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import APIRouter
from fastapi.responses import Response
import time
router = APIRouter()
EVENTS_TOTAL = Counter("events_total", "Total LINE events", ["type"])
ERRORS_TOTAL = Counter("errors_total", "Total errors", ["where"])
LATENCY = Histogram("handler_latency_seconds", "Handler latency", ["name"])
@router.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
from contextlib import contextmanager
@contextmanager
def observe(name: str):
start = time.time()
try:
yield
finally:
LATENCY.labels(name=name).observe(time.time() - start)
在事件處理時:
from metrics import router as metrics_router, EVENTS_TOTAL, ERRORS_TOTAL, observe
app.include_router(metrics_router)
EVENTS_TOTAL.labels(type="text").inc()
with observe("text_handler"):
... # 你的處理
⸻
from fastapi import Request
from fastapi.responses import JSONResponse
import logging, traceback
log = logging.getLogger("uvicorn.error")
@app.exception_handler(Exception)
async def global_exc(request: Request, exc: Exception):
log.error("unhandled", exc_info=exc)
return JSONResponse({"error": "internal error"}, status_code=500)
⸻
⸻
⸻
小結
今天把觀測與韌性一次補齊:
• /healthz / /readyz 讓部署與自動修復更可靠
• JSON 日誌+Request ID 讓排錯有效率
• 超時+重試+熔斷/節流 撐住外部不穩的情況
• Prometheus 指標 快速看懂尖峰、錯誤與熱門功能
已經具備「生產級」基本功。